Supervised Learning and Visualization

This lecture

  1. Course pages
  2. Course overview
  3. Introduction to SLV
  4. (Dark) Data Science
  5. Data Wrangling
  6. Wrap-up

Procedural stuff

  • If there is anything important - contact me!

  • The on-location lectures will not be recorded.

    • If you are ill, ask your classmates to cover for you.
  • If you feel that you are stuck, ask your classmates, ask me, ask the other lecturers. Ask a lot! Ask questions during/after the lectures and in the Q&A sessions.

    • You are most likely not the only one with that question. You are simply the bravest or the first.
    • Do not contact us via private chat or e-mail for content-related questions.
  • If you expect that you are going to miss some part(s) of the course, please notify me via a private MS-Teams message or e-mail.

Course pages

You can find all materials at the following location:

https://dgoretzko.github.io/slv/


All course materials should be submitted through a pull-request from your Fork of

LINK


The structure of your submissions should follow the corresponding repo’s README. To make it simple, I have added an example for the first practical. If you are unfamiliar with GitHub, forking and/or pull-request, please study this exercise from another course. There you can find video walkthroughs that detail the process.

Course overview

Team

Topics

Week # Focus Teacher Materials
1 Data wrangling with R DG R4DS ISLR
2 The grammar of graphics DG R4DS
3 Exploratory data analysis DG R4DS FIMD
4 Statistical learning: regression MC ISLR, TBD
5 Statistical learning: classification EJvK ISLR, TBD
6 Classification model evaluation EJvK ISLR, TBD
7 Nonlinear models MC ISLR, TBD
8 Bagging, boosting, random forest and support vector machines MC ISLR, TBD

Course Setup

Each weak we have the following:

  • 1 Lecture on Monday @ 9am in Dalton 500 8.27
  • 1 Practical (not graded). Must be submitted to pass. Hand in the practical before the next lecture
  • 1 combined workgroup and Q&A session in Dalton 500 8.27
  • Course materials to study. See the corresponding week on the course page.

Twice we have:

  • Group assignments
  • The assignment is made in teams (3-4 students).
  • Each assignment counts towards 25% of the total grade. Must be > 5.5 to pass.

Once we have:

  • Individual exam
  • BYOD: so charge and bring your laptop.
  • 50% of total grade. Must be > 5.5 to pass.

Groups

We will make groups on Wednesday Sept 13!

Introduction to SLV

Terms I may use

  • TDGM: True data generating model
  • DGP: Data generating process, closely related to the TDGM, but with all the wacky additional uncertainty
  • Truth: The comparative truth that we are interested in
  • Bias: The distance to the comparative truth
  • Variance: When not everything is the same
  • Estimate: Something that we calculate or guess
  • Estimand: The thing we aim to estimate and guess
  • Population: That larger entity without sampling variance
  • Sample: The smaller thing with sampling variance
  • Incomplete: There exists a more complete version, but we don’t have it
  • Observed: What we have
  • Unobserved: What we would also like to have

Some statistics

At the start

We begin this course series with a bit of statistical inference.

Statistical inference is the process of drawing conclusions from truths

Truths are boring, but they are convenient.

  • however, for most problems truths require a lot of calculations, tallying or a complete census.
  • therefore, a proxy of the truth is in most cases sufficient
  • An example for such a proxy is a sample
  • Samples are widely used and have been for a long timeSee Jelke Bethlehem’s CBS discussion paper for an overview of the history of sampling within survey statistics

Being wrong about the truth

  • The population is the truth
  • The sample comes from the population, but is generally smaller in size
  • This means that not all cases from the population can be in our sample
  • If not all information from the population is in the sample, then our sample may be wrong


    Q1: Why is it important that our sample is not wrong?
    Q2: How do we know that our sample is not wrong?

Solving the missingness problem

  • There are many flavours of sampling
  • If we give every unit in the population the same probability to be sampled, we do random sampling
  • The convenience with random sampling is that the missingness problem can be ignored
  • The missingness problem would in this case be: not every unit in the population has been observed in the sample




Q3: Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


    All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.

Sidestep

  • The problem is a bit larger

  • We have three entities at play, here:

    1. The truth we’re interested in
    2. The proxy that we have (e.g. sample)
    3. The model that we’re running
  • The more features we use, the more we capture about the outcome for the cases in the data

  • The more cases we have, the more we approach the true information


Core assumption: all observations are bonafide

Uncertainty simplified

When we do not have all information …

  1. We need to accept that we are probably wrong
  2. We just have to quantify how wrong we are


In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.

The uncertainty measures about our estimates can be used to create intervals

Rumsfeld moment of fame in statistics

Confidence intervals

Confidence intervals can be hugely informative!

If we sample 100 samples from a population, then a 95% CI will cover the population value on average 95 out of 100 times.

  • If the coverage <95: bad estimation process with risk of errors and invalid inference
  • If the coverage >95: inefficient estimation process, but correct conclusions and valid inference. Lower statistical power.

The other type of intervals

Prediction intervals can also be hugely informative!

Prediction intervals are generally wider than confidence intervals

  • This is because it covers inherent uncertainty in the data point on top of sampling uncertainty
  • Just like CIs, PIs will become more narrow (for locations) where more information is observed (less uncertainty)
  • Usually this is at the location of the mean of the predicted values.


Narrower intervals mean less uncertainty. It does not mean less bias!

The holy trinity

Whenever I evaluate something, I tend to look at three things:

  • bias (how far from the truth)
  • uncertainty/variance (how wide is my interval)
  • coverage (how often do I cover the truth with my interval)


As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff

Now with missingness

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

Now with missingness

We now have a new problem:

  • we do not have the whole truth; but merely a sample of the truth
  • we do not even have the whole sample, but merely a sample of the sample of the truth.

The statistical solution

There are two sources of uncertainty that we need to cover:

  1. Uncertainty about the missing value:
    when we don’t know what the true observed value should be, we must create a distribution of values with proper variance (uncertainty).
  2. Uncertainty about the sampling:
    nothing can guarantee that our sample is the one true sample. So it is reasonable to assume that the paramaters obtained on our sample are biased.


More challenging if the sample does not randomly come from the population or if the feature set is too limited to solve for the substantive model of interest

Now how do we know we did well?

I’m really sorry, but:
We don’t. In practice we may often lack the necessary comparative truths!

For example:

  1. Predict a future response, but we only have the past
  2. Analyzing incomplete data without a reference about the truth
  3. Estimate the effect between two things that can never occur together
  4. Observe rather selective instances from the population

Bringing it in perspective

Focus points

  1. What are statistical learning and visualization?
  2. How does it connect to data analysis?
  3. Why do we need the above?
  4. What types of analyses and learning are there?

Some example questions

  • Did our imputations make sense?
  • Who will win the election?
  • Is the climate changing?
  • Why are women underrepresented in STEM degrees?
  • What is the best way to prevent heart failure?
  • Who is at risk of crushing debt?
  • Is this matter undergoing a phase transition?
  • What kind of topics are popular on Twitter?
  • How familiar are incoming DAV students with several DAV topics?

Goals in data analysis

  • Description:
    What happened?
  • Prediction:
    What will happen?
  • Explanation:
    Why did/does something happen?
  • Prescription:
    What should we do?

Modes in data analysis

  • Exploratory:
    Mining for interesting patterns or results
  • Confirmatory:
    Testing hypotheses

Some examples

Exploratory Confirmatory
Description EDA; unsupervised learning Correlation analysis
Prediction Supervised learning Theoretical modeling
Explanation Visual mining Causal inference
Prescription Personalised medicine A/B testing

In this course

  • Exploratory Data Analysis:
    Describing interesting patterns: use graphs, summaries, to understand subgroups, detect anomalies, understand the data
    Examples: boxplot, five-number summary, histograms, missing data plots, …

  • Supervised learning:
    Regression: predict continuous labels from other values.
    Examples: linear regression, generalized additive model, regression trees,…
    Classification: predict discrete labels from other values.
    Examples: logistic regression, support vector machines, classification trees, …


image source

Exploratory Data Analysis workflow

Data analysis

How do you think that data analysis relates to:

  • “Data analytics”?
  • “Data modeling”?
  • “Machine learning”?
  • “Statistical learning”?
  • “Statistics”?
  • “Data science”?
  • “Data mining”?
  • “Knowledge discovery”?

Explanation

People from different fields (such as statistics, computer science, information science, industry) have different goals and different standard approaches.

  • We often use the same techniques.
  • We just use different terms to highlight different aspects of so-called data analysis.
  • All the terms on the previous slides are not exact synonyms.
  • But according to most people they carry the same analytic intentions.

In this course we emphasize on drawing insights that help us understand the data.

Some examples

Spaceshuttle Challenger

36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.

How wages differ

The origin of cholera

Predicting the outcome of elections

Google Flu Trends

Identifying Brontës from Austen

Nothing happened, so we ignored it

In the decision process that led to the unfortunate launch of spaceshuttle challenger, some dark data existed.

Dark data is information that is not available.

Such unavailable information can mislead people. The notion that we could potentially be misled is important, because we then need to accept that our outcome analysis or decision process might be faulty.

If you do not have all information, there is always a possibility that you arrive at an invalid conclusion or a wrong decision.

From data collection to output

Why analysis and visualization?

  • When high risk decisions are at hand, it paramount to analyze the correct data.

  • When thinking about important topics, such as whether to stay in school, it helps to know that more highly educated people tend to earn more, but also that there is no difference for top earners.

  • Before John Snow, people thought “miasma” (some form of bad air) caused cholera and they fought it by airing out the house. It was not clear whether this helped or not, but people thought it must because “miasma” theory said so

  • Election polls vary randomly from day to day. Before aggregating services like Peilingwijzer, newspapers would make huge news items based on noise from opinion polls.

  • If we know flu is coming two weeks earlier than usual, that’s just enough time to buy shots for very weak people (but be aware: Changing data conditions).

  • If we know how ecosystems are affected by temperature change, we know how our forests will change in the coming 50-100 years due to climate change.

There is a need

The examples have in common that data analysis and the accompanying visualizations have yielded insights and solved problems that could not be solved without them.

  • On some level, humans do nothing but analyze data;
  • They may not do it consistently, understandibly, transparently, or correctly, however;
  • DAV help us process more data, and can keep us honest;
  • DAV can also exacerbate our biases when we are not careful.

Thought

Data wrangling

Wrangling in the pipeline

Data wrangling is the process of transforming and mapping data from one “raw” data form into another format.

  • The process is often iterative
  • The goal is to add purpose and value to the data in order to maximize the downstream analytical gains

Source: R4DS

Core ideas

  • Discovering: The first step of data wrangling is to gain a better understanding of the data: different data are worked and organized in different ways.
  • Structuring:The next step is to organize the data. Raw data are typically unorganized and much of it may not be useful for the end product. This step is important for easier computation and analysis in the later steps.
  • Cleaning: There are many different forms of cleaning data, for example one form of cleaning data is catching dates formatted in a different way and another form is removing outliers that will skew results or formatting null values. This step is important in assuring the overall quality of the data.
  • Enriching: At this step determine whether or not additional data would benefit the data set that could be easily added.
  • Validating: This step is similar to structuring and cleaning. Use repetitive sequences of validation rules to assure data consistency as well as quality and security. An example of a validation rule is confirming the accuracy of fields via cross checking data.
  • Publishing: Prepare the data set for use downstream, which could include use for users or software. Be sure to document any steps and logic during wrangling.

Source: Trifacta

To Do